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In regression with random design, we study the problem of selecting a model that performs 
well for out-of-sample prediction. We do not assume that any of the candidate models under 
consideration are correct. Our analysis is based on explicit finite-sample results. Our main 
findings differ from those of other analyses that are based on traditional large-sample limit 
approximations because we consider a situation where the sample size is small relative to the 
complexity of the data-generating process, in the sense that the number of parameters in a 
'good' model is of the same order as sample size. Also, we allow for the case where the number 
of candidate models is (much) larger than sample size. 
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1. Introduction 

Some of today's most challenging statistical problems feature a large number of poten- 
tially important factors or variables and a comparatively small sample size. For example, 
van't Veer et al. (2002) successfully use gene expression profiling to predict recurrence 
of breast cancer using a classifier comprised of 70 genes that are selected from a total of 
about 25 000 based on a sample of size 78; see also van de Vijver et al. (2002). In such 
applications, the goal of model selection is often not to find 'the correct model', but 
rather a model that performs well for prediction. Moreover, the number of explanatory 
variables (e.g., genes) in the selected model is often of the same order as sample size and 
the number of candidate models (e.g., subsets of genes) is much larger than sample size. 
We consider one problem of that kind: regression with random design, where the true 
model is allowed to be infinite-dimensional, and where the goal is to find a model with 
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'good' out-of-sample predictive performance.^ We focus on situations where the sample 
size is relatively small, in the sense that the number of parameters in a 'good' model 
is of the same order as sample size. We also allow for the case where the number of 
candidate models is (much) larger than sample size. To select a 'good' model, we esti- 
mate the performance of candidate models and select the one with the best estimated 
performance. 

1.1. Classical procedures 

Of course, model selection based on estimated predictive performance has already been 
extensively studied. Methods developed for that aim include the Sp criterion (which 
can be traced back to Tukcy (1967); sec also Hocking (1976), Thompson (1976a,1976b)); 
the Akaike information criterion (AIC; Akaike (1969)); the final prediction error criterion 
(FPE; Akaike (1970)); the Cp criterion of Mallows (1973); the generahzed cross-validation 
criterion (GCV; Craven and Wahba (1978)); or the small-sample corrected version of AIC 
(AICc; Hurvich and Tsai (1989)). Minimizing these criteria over a class of candidate 
models leads to a model selection procedure that is conservative (or over-consistent) in 
parametric settings.^ Alternatively, consistent model selection procedures can be used, 
including the Bayesian information criterion (BIC; Schwarz (1978)) or the minimum 
description length criterion (MDL; Rissanen (1978)). Further related methods include 
the prediction criterion (PC) of Amemiya (1980) and the risk inflation criterion (RIC) 
of Foster and George (1994). 

Existing performance analyses of these model selectors do not give a clear picture 
as to what method is preferable. Consider a so-called post-model-selection estimator, 
that is, an estimator obtained by first selecting a model based on the training data and 
then fitting the selected model to the same training data by a method like least-squares 
or maximum likelihood. In a parametric setting, Kempthorne (1984) showed that any 
post-modcl-selection estimator is admissible within the class of all post-model-selection 
estimators (for squared error in-samplc prediction loss). In large samples, it is well known 
that AIC and similar procedures are asymptotically efficient (in a certain sense) if the 
true model is infinite-dimensional, while BIC and related methods are efficient if the true 
model is finite-dimensional (cf. Shao (1997) and the references given therein). In finite 
samples, however, BIC can be more efficient than AIC (or vice versa), depending on 
sample size and on the unknown parameters, both in the parametric and the nonpara- 
metric cases (cf. Kabaila (1998)). Yang (2005) showed that one cannot find a procedure 

^ Here, 'out-of-samplo prediction' means prediction of new responses given hitherto unobserved ex- 
planatory variables, whereas 'in-sample prediction' means prediction of new responses for the same 
explanatory variables as in the training data. 

^ In a parametric setting, most model selectors can be broadly classified as either consistent or 
conservative: Consistent model selectors are such that the probability of selecting the most parsimonious 
correct model goes to 1 as sample size increases; conservative model selectors arc not consistent, but 
such that the probability of selecting an incorrect model goes to zero. 
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that combines the strengths of AIC and BIC. In the case where the true model is finite- 
dimensional and as sample size gets large, consistent model selectors choose the smallest 
correct model with probability approaching 1, while conservative ones do not; however, 
consistent model selectors also lead to unbounded worst-case risk, while the worst-case 
risk corresponding to conservative procedures typically stays bounded in large samples 
(cf. Leeb and Potscher (2005) as well as Leeb and Potscher (2008)). Hence, from the per- 
spective of existing performance analyses, one cannot prefer one of these model selectors 
over the other in general because the performance of a given model selector depends on 
unknown parameters and on sample size. 

1.2. New approaches 

In this paper, we adopt a different perspective that provides new results and insights. 
To explain, we first note that the aforementioned analyses that are based on asymptotic 
considerations rely on large-sample limit approximations that 'kick in' provided that 
the sample size is 'sufficiently large'; the precise meaning of 'sufhciently large' typically 
depends on the underlying true data-generating process and more complex processes 
usually require larger samples.'^ In practice, however, one often faces a very different 
situation, namely one where the given sample size is relatively small compared to the 
complexity of the data-generating process, for example, in the sense that the number of 
parameters in a 'good' model is of the same order as sample size. In addition, the number 
of candidate models is often (much) larger than sample size. Here, we adopt a framework 
that is specifically designed for such scenarios. 

We find that generalized cross-validation and Tukey's Sp criterion perform well in 
selecting a 'good' model, even if the candidate models are complex when compared to 
sample size and also if the number of candidate models is much larger than sample 
size. More specifically, we show that the true out-of-sample predictive performance of a 
candidate model is well approximated by generalized cross-validation (or by the objec- 
tive function of the Sp criterion) with high probability, uniformly over large classes of 
candidate models and uniformly over huge regions in parameter space under very weak 
conditions; for details, see Theorem 3.4 and Corollary 3.5. Moreover, we find that several 
other model selectors, including AIC and BIC, can be systematically defective when eval- 
uating complex models and that their performance can be anything from satisfactory or 
mildly suboptimal to completely unreasonable, depending on unknown parameters. (This 
is in stark contrast to the well-known result that generalized cross-validation and the Sp 
criterion are asymptotically equivalent to AIC - a result that holds asymptotically as 
the sample size gets large relative to the complexity of the data-generating process.) Our 
findings are based on explicit finite-sample results (cf. Theorem 3.2 and Corollary 3.3) 
and backed up by simulation examples. 

^ Here, the precise meaning of 'complexity' depends on the details of the approximation that is 
being considered. In many cases, 'complexity' is related to the number of parameters or to smoothness 
conditions. 
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Conceptually, our approach is inspired by Beran (1996), Beran and Diimbgen (1998) 
and Beran (2000). The setting in these papers is similar to ours, in that the number of 
explanatory variables is of the same order as sample size. However, these papers con- 
sider regression with fixed design and the focus is on estimating a different performance 
measure, namely on the Euclidean distance between the true location parameter and the 
estimate. In that setting, the performance of any estimator depends only on the estima- 
tor itself and on the unknown true regression parameter. In contrast, we focus on the 
out-of-sample predictive performance and we consider random design. In our setting, the 
out-of-sample predictive performance of any estimator, in addition to depending on the 
estimator itself and the regression parameter, also depends on the design distribution, 
which is unknown. (If the number of design variables under consideration is sufficiently 
small in relation to sample size, the empirical distribution of these design variables can be 
used as a proxy for the true design distribution. In the setting that we consider, however, 
this does not work because the number of design variables considered is not necessarily 
small relative to sample size.) In the setting of Beran (1996), a Cp- or AlC-like approach 
to loss estimation is shown to work well. In our setting, we find that Cp and AIC do not 
work well and that a different approach to performance estimation is required. 

A related direction of research was initiated by Barron, Birgc and Massart (1999) and 
further explored by Yang (1999) and Baraud (2002); see also Wegkamp (2003), as well 
as the references in these papers. Instead of attempting to estimate the performance 
of candidate models, these papers provide finite-sample upper bounds for the risk of 
post-model-selcction estimators that are based on minimizing an objective function like 
penalized maximum likelihood or penalized least-squares, where the risk is defined as the 
expected Euclidean distance between the true regression parameter and the estimator, 
or some similar (known) distance measures, as in Baraud (2002). Under some conditions 
and for Cp- or AlC-like penalty functions, these upper bounds give so-called oracle in- 
equalities, stating that the true risk of the post-model-selection estimator is within a 
constant multiple of the risk obtained by fitting the minimal-risk model. (Note that the 
upper bound provided by such an oracle inequality is not known in practice because 
it depends on the unknown regression parameter.) Our results differ from these in two 
important aspects. First, wc consider a different objective, namely minimizing the out-of- 
samplc prediction risk (where the performance of an estimator, in addition to depending 
on the estimator and on the true regression parameter, also depends on the unknown 
distribution of the design variables), and we focus on the case where the sample size is 
small relative to the complexity of the data-generating process, a case where Cp- or AIC- 
like objective functions do not perform well. Second, instead of giving upper bounds, we 
show that the performance of the resulting post-model-selection estimator can actually 
be estimated in our setting. We give finite-sample bounds on the estimation error prob- 
ability that depend only on quantities that are cither known or that can be estimated in 
a uniformly consistent fashion. 

Technically, our paper relics heavily on the results of Breiman and Frcedman (1983), 
who give a large-sample limit analysis of model selection by the Sp criterion for the special 
case of nested candidate models. A precursor version of our paper that was written in 
2005 was instead based on the Marcenko-Pastur law (cf. Marcenko and Pastur (1967)). 
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1.3. Outline of this paper 

For a sample of n observations from some data-generating process to be specified later, 
we consider a collection A4n of candidate models m G Ain with dimension |m|. (We use 
the symbols m to denote a candidate model and |to| to denote the number of explanatory 
variables in the model m.) We do not assume that the true regression function is correctly 
described or even well approximated by any of the candidate models. Under model to, 
the response is related to a collection of |to| explanatory variables. The leading case of 
interest is where the sample size is small relative to the complexity of the true data- 
generating process, a case where 'interesting' models are such that \m\/n is large, for 
example, \m\/n equals 0.1, 0.5 or even 0.9. We focus on the case where the number of 
candidate models, that is, #A^„, is as large as, or much larger than, sample size.* Our 
objective is to select a model that performs well for out-of-sample prediction, that is, 
for predicting a new response given hitherto unobserved explanatory variables. For a 
fixed set of new explanatory variables, the model that performs best for predicting the 
corresponding response can, and typically will, depend on the values of these explanatory 
variables (cf. Claeskens and Hjort (2003)). To identify a model that performs well in an 
overall sense, we consider random design and we evaluate a model's performance by the 
conditional mean squared error of the corresponding predictor, where the conditioning 
is on the training sample. In other words, we search for a model that, when fitted to 
the given training sample, performs well on average when repeatedly predicting new 
responses. (Of course, the case of random design also is a scenario of interest in its own 
right.) The conditional out-of-sample prediction error associated with model m is denoted 
by p'^{m) and we consider the generalized cross-validation criterion GCV(m) and Tukey's 
Sp criterion Sp{m), as well as an auxiliary criterion p^im) (that will be defined later), as 
estimators for p^{ni); see Section 2 for the details. We also consider other model selection 
criteria, namely the Akaike information criterion AIC(to), Hurvich and Tsai's AICc(to) 
and the final prediction error criterion FPE(to); as well as the Bayesian information 
criterion BIG (to). 

A theoretical analysis of the aforementioned criteria is given in Section 3, under the 
assumption that the data are sampled from a Gaussian distribution. We first give an 
explicit finite-sample analysis of the auxiliary criterion p^(rn) in Section 3.1. These results 
allow us to show that generalized cross-validation and the Sp criterion can be used to 
select a good model with high probability, uniformly over large families of candidate 
models and uniformly over huge regions in parameter space under very weak conditions; 
see Section 3.2. Finally, the performance of other model selectors is discussed in Section 
3.3. (On a technical level, the results in Section 3 rely heavily on the assumption of 
Gaussianity, but we suspect that similar findings might be obtained in more general 
settings and our simulation results appear to support this.) 

^Huber (1973) considers a related setting, where the dimension of the overall model, denoted by k, is 
finite, but increases with n such that k/n — > 0. He notes that settings where k/n and \m\/n are large "are 
unlikely to yield a reasonably simple asymptotic theory" (cf. page 802 of that paper). See also Portnoy 
(1984, 1985) and Mammen (1989). 
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The impact of our theoretical results is demonstrated in a simulation study in Section 
4, where we also consider non-Gaussian samples. Our simulations include examples where 
a sample of size 1 300 is used to select a model from over a million candidate models. 
We demonstrate that model selection by generalized cross-validation or the Sp criterion 
performs very well here and that the performance of these model selectors is basically 
unaffected by departures from normality. For the other model selectors that we consider, 
that is, for AlC(m), AlCc(m), FPE(m) and BlC(m), we find in the theoretical analysis 
and in the simulation examples that their performance can be anything from satisfactory 
or mildly suboptimal to completely unreasonable, depending on unknown parameters. 
The more technical parts of the proofs are given in the Appendix. 

2. Setting of the analysis 

Consider a response y that is related to explanatory variables x = {xj)°°^i by 

oo 

for some /3 = {/3j)j°^i ■ Throughout, we assume that the error u has mean zero and variance 
(7^ > 0, and that the (stochastic) sequence of explanatory variables x = {xj)°^i has mean 
zero and variance/covariance net S = [E(xiXj)]ij>i such that the series in (1) converges 
in squared mean. Moreover, we also assume that the explanatory variables Xj, j are 
each uncorrelated with the error u and that the Xj's are not perfectly correlated among 
themselves.^ The unknown parameters here are the sequence of regression coefficients, 
the error variance and the variance/covariance net of the regressors, that is, /3, and 
E. The (minimal) requirement, that the series in (1) converges in squared mean, restricts 

in a way that depends on E. For example, if S is such that the xj^s have variance 1 
and are uncorrelated with each other, then it is required that f3 that is, /?| < oo. 
(For the case where the explanatory variables arc not centered, extensions of the results 
in this paper are given by Leeb (2007).) 

Consider a sample of size n from (1). The sample will be denoted by {Y,X), where Y is 
the n-vector Y = (y(^), . . . ,y("))', X is the n x oo net X = (x^^)', . . . ,x(")')' and (yW,a;W) 
are independent and identically distributed copies of {y,x), as in (1). Let Pn,p.a,j: de- 
note the distribution of the sample {Y,X) and let En,i3M,T. denote the corresponding 
expectation operator. Similarly, wc write Var^_CT,s[2/] for the variance of y in (1).^ 

As estimators for /?, we consider restricted least-squares estimators corresponding to 
submodels of the overall model (1), under which some coefficients of (3 are restricted 

^In other words, we require, for each fc > 1 and integers ji < j2 < ■ ■ <■ jk i that (xj-^ , • • • , )' is a 
random vector with mean zero and positive definite variance / covariance matrix that is uncorrelated with 
u. 

®It should be noted that Pn, (3,17,1:, ^'n./S.tr.E and Var^.o-,E[y]i in addition to depending on the param- 
eters 13, a and S, also depend on the actual distribution of (y,x) in (1); this dependence is not reflected 
explicitly by the notation. The distribution of {y,x) will always be clear from the context. 
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to zero. Each such submodel can be identified by a 0-1 sequence m = {mj)"^^, where 
nij = if the jth coefficient of (3 is restricted to zero and nij = 1 otherwise; the number of 
unrestricted components, that is, the number of I's in m, is denoted by |m|. Throughout 
the paper, we shall always assume that |m| <n — l7 We call \m\ the order of the model 
m. The restricted least-squares estimator corresponding to the model m is denoted by 
/3(m) and is defined as follows: (3{m) is such that its jth component equals zero whenever 
rrij = 0; the \m\ remaining (unrestricted) components of /3(m) are obtained by regressing 
Y on the corresponding columns of X. 

Based on the sample {Y,X), our objective is to find a 'good' model for out-of-sample 
prediction. To this end, consider a new copy (j/^-^', a;^-'^-') of {y,x), as in (1), that is inde- 
pendent of {Y,X). Given a model m with |m| < n — 1 and the corresponding restricted 
least-squares estimator (3{m), we will use a;'^-* /3(m) as a predictor for y*--^'. To evaluate 
the performance of this predictor, we consider the conditional and unconditional mean 
squared prediction errors given by 

and 

respectively. For the conditional mean squared prediction error p^{m), note that the 
sample {Y,X) is kept fixed and the average is taken only with respect to {y^^\x^^^), 
so p^{m) is a function of I3{m) — j3. In particular, p^(rn) can become large if either the 
model is too complex (so that /3(m) is not close to P because of over- fit), or if important 
explanatory variables are not included in the model (so that /3{m) is not close to f3 because 
of under-fit). Also, note that p^{m) = Vainfi,cj,Y.[y^^^ — x^^'' /3(m)||F, X] here because the 
mean of x^^'^ is zero. For the case where the mean of x^^^ is not zero such that p'^{m) = 
Var„,e,,,s[y(^^ - x^f^ P{m)\\Y,X] + {E^,g,,,^[y^f^ - x^^^ p{m)\\Y,X]f , we refer to Leeb 
(2007): assuming that the sample is Gaussian and that the model includes an intercept, 
it is shown in that paper that the squared bias, that is, {En.e.<y,T.[y''^'^ — x^^^ j3{'m)\Y, X])^, 
is of smaller order than the variance, that is, Va.Yn.9,<j,Yi[y^^^ — x*-^^ /3(m)||y,X]. Our main 
focus will be on the conditional mean squared prediction error, that is, p^(m), rather 
than on the unconditional mean squared prediction error, that is, i?^(m), which is based 
on averaging over hypothetical samples. Also, note that p'^{m) depends only on n, /3, a 
and S; R^{m), on the other hand, also depends on the actual distribution of the random 
variables in (1). 

Remark 2. 1 . Instead of (m) or p^ (jn) , traditional large-sample limit analyses often 
consider error measures like the (unconditional) mean of {x'^^^' P - x'^^^' i3{m)Y , scaled by 
a parametric or nonparametric rate (depending on the setting). In a parametric setting, 
where the parameter (3 has only finitely many non-zero components, this is because the 

We assume |m| < n — 1 for the sake of convenience; some of our results also hold for \m\ < n, while 
others even hold for \m\ <n. 
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mean of (a;*^-^^ /3 — x^^'^ i3{m)Y goes to zero at a rate of 1/n, provided that the model 
m is correct. Similar considerations apply in nonparametric settings under smoothness 
conditions, provided that the dimension of the model increases appropriately with sample 
size. Considering the mean of (x'--^-' fi — x'^^^ (3{m))'^ or of (y*--'^-' —x^^'^ (3{m))'^ is equivalent, 
as far as selecting a 'good' model is concerned, because the two means differ by a fixed 
constant, namely the error variance . The lack of scaling by some rate in p^{m) and 
B?(rn) is caused by the fact that we do not assume a parametric model and we do 
not impose smoothness conditions in a nonparametric model because such assumptions 
would mean that estimation errors go to zero as sample size increases. Instead, we use 
approximations that retain the finite-sample feature that estimation errors are potentially 
large because the sample size is small relative to the complexity of the data-generating 
process. 

The conditional and unconditional mean squared prediction errors depend on unknown 
parameters and thus must be estimated. For a candidate model m, we consider the 
generalized cross-validation criterion GCV(m), the Sp criterion Sp{m) and an auxiliary 
criterion fP{m), which are defined as follows. Let 

GCV(m) = ( V")RSS(m) _ RSS(m) n 



(1 — \m\/n)^ n — |m| n — \m\ 

In the above display, RSS(to) denotes the residual sum of squares obtained by fitting the 
model m, that is, RSS(m) = X]r=i(2/^*^ ~ ^^'^ The generalized cross-validation 

criterion is closely related to the Sp criterion, which is defined by 

„ , . RSS(to) n — 1 
Sp[m) - 



n — m n — 1 — m 



For technical reasons, we also consider another quantity that is closely related to both 
GCV(m) and Sp{m), namely 



, RSS(to) n + 1 



p\m)- 



n — \m\ n -f- 1 — |m 



For most practical purposes, the difference between GCV(to), Sp{m) and fp'{m) is neg- 
ligible. (Also, note that GCV(m), Sp{m) and fP{m) are well defined because we always 
assume that \m\ <n — 1.) The other model selectors mentioned in the Introduction are 
defined in Section 3.3. 



3. Theoretical analysis 

In this section, we study the problem of estimating the conditional and unconditional 
mean squared prediction error in the case where the sample is drawn from a Gaussian 
distribution. We hence assume throughout this section that the random variables in (1) 
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are jointly normal.* Unless otherwise noted, we fix parameters /3, a and S as in (1) and 
consider a fixed sample size n and a fixed model m with |77i| < ?i — 1. For y and x as in 
(1), set 

a^{m) ^YaTf3,cr^T,[y\\xj GN,mj = 1]. 

Note that a'^(ni) is non-random because the involved random variables arc jointly Gaus- 
sian, and that a^{m) < (7^(0) = Va.rp^a-,^[y]- 



3.1. Finite-sample results 

The following result (whose first statement is adapted from Breiman and Freedman 
(1983)) provides the basis for a finite-sample analysis. 

Proposition 3.1. (i) The conditional mean squared prediction error p^im) has the same 
distribution as 1 plus the ratio oj two independent chi-squared random variables with \m\ 
and n — \ni\ + 1 degrees of freedom, respectively, multiplied by a'^{m): 



p^(m)-CT^(m) 1 + 



ill) The residual sum of squares has the same distribution as a chi-squared random 
variable with n~ \m\ degrees of freedom, multiplied by a'^{m): 



RSS(m)^cT2(m)xL 



Proposition 3.1 immediately implies that the unconditional mean squared prediction 
error B?{m) can be computed explicitly as 



because we always assume that |m| < n — 1 (recall the formula for the mean of the F- 
distribution). This also gives the well-known result that Sp{m) is an unbiased estimator 
for the unconditional mean squared prediction error R^{m); the estimators GCV(77i) and 
fP{m) for R^{m) are biased, but the bias is typically negligible. For \m\ < n — 3, we also 
get that the variance of p'^{m) is finite and given by 



{n — \m\ — l)^(n — |m| — 3) n (1 — |m|/n)^ 

We see that the conditional mean squared prediction error, that is, p^{m), is highly 
concentrated around its mean, that is, R^{m), provided only that n is large enough 

^Note that assuming the sample to be Gaussian entails that f'n,/3,CT,E a^nd E^ fj u -^, as well as 
Var^ ,j,e[2/]i a-re uniquely determined by the parameters in the subscript. 
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relative to a^{'m)/{l — \m\/n)^. This suggests that Sp{m), GCV(m) and p^{m) can be 
used to estimate not only (m) , but also the conditional mean squared prediction error 
p^{m). In order to use these considerations for model selection, we need to establish that, 
say, /P{m) is close to p'^{m) with high probability, not only for a fixed model to, but for 
an entire collection of candidate models. This is accomplished by the following theorem 
and the attending corollary. 

Theorem 3.2. For each e > 0, we have 



where ^(•) is defined by "^{x) = {x/{x + l))^/8 for x >0. (In the case a'^{m) = 0, the 
upper bound is to be interpreted as zero.) 

For fixed e > 0, the upper bound in Theorem 3.2 is of the form 4exp[— nC], where C is 
always positive. This upper bound is exponentially small in ti, provided only that |TO|/n 
is bounded away from 1 and a'^(m) is bounded away from infinity. The upper bound 
depends on the known quantities n, |m|/n and e, and also on a'^{m), which is unknown. 
However, recall that a'^{in) is bounded from above by cr2(0) = Varp^^.^iy], that is, by the 
variance of y in (1), which can be readily estimated from the sample, for example, by 
[n — ~ 5)^1 where y denotes the mean of the responses y^^\ ■ ■ ■ , y*-"-* in the 

training sample. Thus, we see that the upper bound in Theorem 3.2 is exponentially small 
in n, provided only that both the complexity of the candidate model and the variance 
of the response, that is, |to| and Var/3,(j,s[2/] , are not too large. These considerations, 
together with Bonfcrroni's inequality, immediately lead to the following result. 

Corollary 3.3. Consider a (finite and non-empty) collection Ain of candidate models 
and let rn = sup,„g^^ \m\/n. Then 



for each e > and for each (finite) c > 0. (Here, #A^„ denotes the number of candidate 
models and ^'(•) is as in Theorem 3.2.) 

The upper bound given in Corollary 3.3 is of the form 4exp[— nZ? + log^A^„], where 
the constant D > depends on r„ and c (for fixed e > 0). In particular, the upper bound 
is small provided only that the variance of the response, the complexity of the candidate 
models and the number of candidate models are not too large in relation to sample size. 



-Pn,/3,cr,s(|/5^(m) - p^{m)\ > e) 





< 4#X„exp[-n(l - r„)*((e/(2c))(l - r„))] 
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Clearly, results paralleling Theorem 3.2 and Corollary 3.3 can also be derived when 
either generalized cross-validation or the Sp criterion, that is, GCV(-) or Sp{-), are used 
instead of p^{-)- The reason for considering fP{-) here is that this estimator leads to the 
most simple and most revealing upper bound. (In most practical cases, the distinction 
between p^(m), GCV(m) and Sp{m) is negligible anyway. In the Appendix, we also give a 
variant of Theorem 3.2 with Sp{m) and R^{m) replacing p^{m) and p^im), respectively; 
see Proposition A. 5.) 

It should be noted that the upper bound in Theorem 3.2 does not go to zero as e goes 
to infinity. (The same also applies to the upper bound in Corollary 3.3, which is derived 
from that in Theorem 3.2.) The upper bound in Theorem 3.2 is, in fact, based on a 
tighter, but more complicated, bound that is given in Proposition A. 4 in the Appendix; 
that tighter bound docs go to zero as e — > oo. We present the bound of Theorem 3.2 as 
our main result because, for fixed e > 0, it captures in a simple expression the essential 
interplay between the sample size, the complexity of the candidate model and the data- 
generating process that guarantees that the probability of \p'^{m) — p^(m)| exceeding e 
is exponentially small in n. The tighter bound of Proposition A. 4 does the same, but is 
much more complicated. Moreover, the upper bound in Theorem 3.2 is tight enough to 
give the rates of convergence that are presented in the following section. 



3.2. Approximation results 

In this section, we provide conditions under which the upper bounds given previously 
go to zero. Under these conditions, p^{m), GCV(m) and Sp{m) are close to p^{m) with 
probability approaching one, uniformly over a collection of candidate models and uni- 
formly over a large region in parameter space. For the sake of simplicity, the results that 
follow simply state that estimation errors go to zero in probability at a certain rate, 
instead of giving explicit, but more complicated, finite-sample upper bounds. 

Theorem 3.4. For each sample size n, consider a (finite and non-empty) family A4„ 
of candidate models, let r„ =sup,„g^^ |m|/n and define a„ as 



l \0giffMn + l) 



Assume that a„ as n oo. Then 



sup \GCY{m)-p\m)\=Op{an) (3) 

holds uniformly over all data- generating processes as in (1) that satisfy Var^^cr.E[2/] ^ c 
(where c > is an arbitrary fixed (finite) constant). Hence, over the indicated set of 
parameters, GCV(to) is a uniformly 1/ On-consistent estimator for p^^m), uniformly in 
m G Ain- The same applies with Sp{m) or fp'^m) in place o/GCV(m). (These statements 
all continue to he true if (m) replaces p^ (m) .) 
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Informally, the condition that a„ — > maintained by Theorem 3.4 imposes two re- 
quirements on the complexity of the candidate models and on the number of candidate 
models, respectively, in relation to sample size: (i) that the candidate models are not too 
complex, that is, r„ is not too close to 1, so that n{l — rn)^ can get large; (ii) that the 
number of candidate models is not too large, in the sense that log#A^„ is of smaller 
order than n{l — r„)'^. The first requirement, that is, that r„ is not too close to 1, only 
rules out cases that are susceptible to severe over- fit anyway. The second requirement, 
that is, that log#A1„ = o(n(l — r„)^), rules out certain cases of complete subset selec- 
tion, for example, the case where #A^„ = 2". However, that requirement typically still 
allows for considerably large classes of candidate models. In practice, limited computa- 
tional resources will often entail much stronger restrictions on the number of candidate 
models that can be considered. The consequences of Theorem 3.4 for model selection are 
immediate. 

Corollary 3.5. In the setting of Theorem 3.4, assume that an -^0- Consider (measur- 
able) minimizers o/GCV(m) and p^{m) over Ain, 

m* = argminGCV(m) and m* argminp^(TO). 

(i) The empirically best model, that is, m^, is asymptotically as good as the best model, 
in the sense that 

-p^(m;)| =Op(a„), 

uniformly over all data-generating processes as in (1) that satisfy Var^^cr.s[2/] < c (where 
c > is an arbitrary fixed (finite) constant). 

(ii) The predictive performance of the model rn* can be estimated in a uniformly con- 
sistent fashion, in the sense that 

\GCYim:)-p^{m:)\=Opian), 

uniformly over all data- generating processes as in (1) that satisfy Var^^cr.s[2/] < c. 

The above continues to hold if, throughout, GCV(-) is replaced by Sp{-) or p^{-). (These 
statements continue to be true if R^{-) replaces p'^{-).) 

If Corollary 3.5 applies, the generalized cross-validation criterion (or, cquivalently, ei- 
ther the Sp criterion or fP{m)) can be used to select a good model whose estimated 
performance is close to its actual performance (with probability approaching 1), uni- 
formly over the indicated region in parameter space. That region in parameter space is 
characterized by an upper bound on Var^ o-^^iy], that is, on the variance of the response 
y in the overall model (1). Boundedness of the response's variance is a very innocuous 
restriction, showing that the performance of generalized cross-validation (or the related 
criteria Sp(rn) and p^{m)) is guaranteed over a huge region in parameter space: for 
example, fix cr^ and fix E such that the explanatory variables in (1) are uncorrelated 
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with unit variance; for c> a^, the condition Var^^cr,s[2/] c then requires that /3 satisfies 
J2j 0j ^c — a^, that is, /3 can range over a non-compact subset of h- 

For parameters f3, a and S satisfying Var^,(j_s[?/] < c, for a given model m and for a 
fixed sample size n, the conditional and unconditional mean squared prediction error can 
take on any value between between and Var^_cr.s[2/] (because the model m can contain 
anything between all and none of the non-zero coefficients of f3). By considering such 
parameters in Corollary 3.3, Theorem 3.4 and Corollary 3.5, we focus on situations where 
the noise u^-^"^ is not the dominant source of error when predicting y^^^ = x^J'^ (3 -\- u^^"^ 
out-of-sample. This captures scenarios where the sample size is small, relative to the 
complexity of the true data-generating process. Our results show that generalized cross- 
validation (or the Sp criterion or fP(m)) performs very well in such situations. 

3.3. Other model selectors 

It is instructive to compare generalized cross-validation and the Sp criterion to 
other model selection methods. We consider some classical examples, namely the 
Akaikc information criterion (AIC), the AIC with finite-sample correction (AICc) 
of Hurvich and Tsai, Akaike's final prediction error criterion (FPE), and Schwarz' 
Baycsian information criterion (BIC), whose objective functions are given by AlC(m) = 
n-iRSS(m)cxp(2|TO|/n), AICc(to) =n-iRSS(m)cxp(2(|m|-hl)/(n-|TO|-2)),FPE(7n) = 
n-iRSS(m)(l 4- \m\/n)/{l - \m\/n) and BIC(to) = n-iRSS(m)nl™l/", respectively. 
(Traditionally, AIC, AICc and BIC are defined on a logarithmic scale; the equiva- 
lent exponential scale used here is more convenient for our purposes. We also as- 
sume here that |m| < n — 2, to ensure that AlCc(m) is well defined.) Note that the 
objective functions of AIC, AICc, FPE and BIC are strictly increasing in RSS(m) 
and that the same is true for GCV(to). This allows us to express, say, AlC(m) as 
AlC(m) =GCV(m)e2|'"l/"(l - \m\/n)^, informally suggesting the following: The AIC- 
objective function AlC(m) is close to 

p2(m)e'l'"l/"(l- |m|/n)2; (4) 

the FPE-objective function FPE(m) is close to 

p^{m){l + \m\/n){l ~ \m\/n); (5) 

the objective function AICc(7n) is close to 

p2(m)e2(l"l+i)/("-l"l-2)(l - \m\/nf; (6) 

and BlC(m) is close to 

p^(m)nl™l/"(l-|TO|/n)2. (7) 

More formally, in the setting of Theorem 3.4 and provided that the quantity a„ defined 
there goes to zero, the diff'erences between AlC(m) and FPE(m) and the quantities in 
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(4) and (5), respectively, converge to zero uniformly in m G A4n, where convergence 
is in probability, uniformly over the set of parameters satisfying Var^_CT,E[?/] £ c with 
c > 0. (For AIC, this immediately follows from Theorem 3.4 because the events where 
|AIC(to) - p2(r7^)e2|™l/"(i _ \m\/n)^\ > e and where |GCV(m) - p^{m)\ > ee-2|™l/"(l - 
|m|/n)~^ coincide and are contained in the event where |GCV(m) — p^(m)| > ee~^; for 
FPE, a similar argument applies.) The same is true for AlCc(m) and (6), under the 
additional assumption that limsup„r„ < 1, as well as for BlC(m) and (7), under the 
additional assumptions that limsup„r„ < 1 and n''"(l — r„)^a„ — > 0, as is easily seen. 

To see how AIC, AICc, FPE and BIC perform compared to generalized cross-validation 
and the Sp criterion, first consider the case where the number of explanatory variables 
is of the same order as sample size, that is, the case where \m\/n is not close to zero. In 
that case, (4)-(7) suggest that AlC(m), FPE(to), AlCc(m) or BlC(m) will not give a 
good estimator for p'^{m) or R^ijn). Whenever |m| > 1, the expressions (4) and (5) are 
always smaller than p^{m); hence, AlC(m) and FPE(to) tend to underestimate p^irn). 
Similarly, for |77i| > 1, the expressions in (6) and (7) are always larger than p'^{m), so 
AlCc(m) and BlC(m) tend to overestimate p^{m). More importantly, these criteria will 
typically not select a model with small (conditional or unconditional) mean squared 
prediction error because the minimizcrs of p^(m) or R^{m) over m G A4n typically differ 
from the minimizcrs of (4), (5), (6) or (7). Hence, if the sample size is small relative to 
the complexity of the true data-generating process, such that p^(m) is minimized by a 
model m with jml/n not close to zero, then the objective functions of AIC, AICc, FPE 
or BIC can give a distorted picture of that model's performance, both in absolute terms 
and relative to other candidate models. These model selectors cannot be guaranteed to 
choose a good model in that situation. 

It should be kept in mind that AIC(to), AlCc(m), FPE(m) and BlC(m) do not, in 
fact, primarily aim to estimate the out-of-sample mean squared prediction error p^(m) (or 
R^{m)). For example, AlC(m) is derived from an estimator of the Kullback-Leibler dis- 
crepancy between the true and the fitted in-samplc predictive distribution; that estimator 
is asymptotically unbiased, provided model m is correct. Further, BlC(m) is derived from 
a first-order expansion of the posterior probability of model m in a Bayesian framework. 
In certain asymptotic settings where the sample size is typically much larger than the 
parameters in the model (and for an appropriately chosen class of candidate models), a 
model minimizing the AIC or the BIC objective function also performs well for prediction 
out-of-samplc in the limit. But, if the number of explanatory variables in the candidate 
model is not small compared to sample size, this correspondence can break down, as we 
see here. Similar considerations apply, mutatis mutandis, to AlCc(m) or FPE(m); see 
Leeb and Potscher (2008) for further details. 

It remains to consider the case where the number of explanatory variables is of smaller 
order than sample size. We consider this case for completeness, even though it is not the 
main focus of this paper. This case is typical for traditional (parametric or nonparamct- 
ric) large-sample settings, where the sample size is (much) larger than the complexity 
of the underlying data-generating process so that it can be described by a model that is 
relatively simple compared to sample size. If jml/n is small, it is easy to see that the ob- 
jective functions GCV(m), Sp{m), p^{m), AIC(to), AlCc(m) and FPE(to) are essentially 
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equivalent as estimators for p^im) or E?{m)\ the same is also typically true for BlC(m), 
provided that |m|/n is small enough, in view of (7). In typical parametric settings, this is 
reflected by the fact that, with probability approaching 1 as sample size increases, these 
objective functions are minimized by correct models only. However, if |m|/n is small and 
the simple model m is a good approximation to the true data-generating process, then 
the noise variance is the dominating factor in both (m) and (m) . To distinguish 
between model selection methods here, it is common to consider other performance mea- 
sures like the (conditional or unconditional) mean of /3 — x''^^ j3{'m))'^ or variants 
thereof, where rh is the model minimizing the objective function under consideration, 
for example. AIC(-) or BIC(-); sec also Remark 2.1. In such a comparison, and in the 
large-sample limit, BIG is typically found to perform differently from the other model 
selectors considered here. But, as outlined in the Introduction, the relative efficiency of 
the post-modcl-selection estimators obtained by, say, AIC and BIG, respectively, depends 
crucially on unknown parameters and on sample size, to the extent that either one can 
be more efficient than the other. We suspect that in such settings, post-model-selection 
estimators, which can be viewed as 0-1-shrinkage-type estimators, are too crude to per- 
form well in general and that methods based on smooth shrinkage arc preferable. This is 
demonstrated by Goldenshluger and Tsybakov (2003), who propose a smooth shrinkage 
estimator that is shown to be asymptotically minimax for out-of-sample prediction over 
Sobolev balls. 

4. Numerical results 

In this section, we investigate the performance of model selectors in finite samples by 
simulation, where we consider the Gaussian well as several non-Gaussian cases. 

We focus on 'hard' problems, where the number of parameters is large compared to 
sample size. We stress that these examples are meant for the purposes of demonstration 
and should not be mistaken for an exhaustive finite-sample simulation analysis. The 
simulation results are shown in Figures 2-4, and are explained in the following subsection. 
For each of three different scenarios introduced below, we consider one fixed realization 
of X and Y (the set of explanatory variables and the response vector, resp.). Given 
a collection of candidate models that will be chosen later, we compare the estimated 
performance of each model m, that is, GGV(to); with its actual performance, that is, 
p^(m); see the solid black curve and the solid gray curve, respectively, in Figures 2-4. In 
addition, the figures also show how AIG(m), AIGc(m), FPE(to) and BIG(m) evaluate the 
models. We have repeated the simulations for other realizations of X and Y; the results 
were essentially unchanged. (The R-code used for the simulations is available from the 
author on request, together with the results of additional simulation nms.) 

4.1. Three simulation scenarios 

In each of the three scenarios, the explanatory variables Xj, j >1, and the error u in 
(1) are taken as mutually independent with mean zero and variance 1 so that E is 
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the identity and cr^ = 1 here. For the actual distribution of the explanatory variables 
and the error, we consider three distributions - normal, exponential and Bernoulli - 
each scaled and centered to have mean zero and variance 1. We consider each of these 
three distributions for the explanatory variables and for the error, resulting in a total 
of nine combinations, for example, the x^'s are i.i.d. normal and u is normal in (1), 
the Xj^s are i.i.d. (recentered and rescaled) exponential and u is normal in (1), etc. The 
case where all random variables in (1) are Gaussian has been analyzed in Section 3 
from a theoretical perspective. The (recentered and rescaled) exponential and Bernoulli 
distributions arc considered because they are very different from the Gaussian, that is, 
highly non-symmetric and discrete, respectively Our simulation results are essentially 
unaffected by departures from normality. In particular, the results from each of the nine 
combinations of distributions are visually indistinguishable from each other in graphs 
like Figures 2-4. In these figures, we therefore only report the results for the case where 
the explanatory variables are i.i.d. (rescaled and recentered) exponentials and where the 
error is standard normal. (The results for the other eight combinations are available from 
the author on request.) 

We now describe the three scenarios underlying Figures 2-4; the scenarios differ in the 
sample size, in the class of candidate models considered and in the underlying regression 
parameter. For each scenario, we choose the parameters so that the problem is hard, in 
the sense that there is a rather large number of acceptable models (i.e., models m such 
that p'^{m) is close to mmrneM„ P^("i)) and in the sense that the acceptable models are 
rather complex. 

For the results in Figure 2, the sample size is n = 700 and we consider leading-term 
submodels, that is, all models m of the form m ~ (1, . . . , 1, 0, . . .) with |m| = 0, . . . , 600; 
this gives a collection of 601 candidate models. The first 600 coefficients of (3 (in ab- 
solute value) are depicted in the top panel of Figure 1; the remaining coefficients 
of (3 are set equal to zero. The parameter /3 is such that the 'signal-to-noise' ratio 
(Var„^/3.o-,s[j/] — cr'^y^^ /cr equals five; the same also applies to the parameters chosen for 
Figures 3 and 4. (If the 'signal-to-noise' ratio is chosen too small, only very parsimonious 
models perform well; increasing the 'signal-to-noise' ratio has the opposite effect. Consis- 
tent with the focus of this paper, we have chosen a 'signal-to-noise' ratio between these 
two extremes.) For Figure 2, the parameter (3 is chosen such that its first 600 components 
are arranged in 'approximately decreasing' order, while the remaining components are 
zero. This scenario is meant to reflect a situation where some prior knowledge is available 
that allows one to arrange the coefficients of /3 by decreasing importance such that the 
consideration of leading-term submodels is appropriate. Because such prior knowledge is 
typically incomplete, the coefficients are only approximately ordered (in absolute value) 
here. The results of this simulation are summarized by the black and gray curves in 
Figure 2. Black curves depend only on the data like, for example, GCV(to), while gray 
curves also depend on the parameters /3, S and cr, like, for example, p^(m). The black 
curves show GCV(to), AIC(7ti), AICc(7ti), FPE(m) and BlC(m) for each of the 601 can- 
didate models m ordered by |m|. For better readability, the points are joined by lines. 
The minimum of each of these black curves is indicated by a solid dot with the name 
of the objective function next to it. The black curves have corresponding gray curves. 
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The gray curves corresponding to GCV(m), AlC(m), FPE(?n), AlCc(m) and BlC(m) 
are given by p'^{ra) and by the expressions in (4)-(7), respectively. For reference, the 
coefficients of /3 (in absolute value) are also plotted at the bottom of Figure 2, with a 
separate axis on the right. 

For the second scenario, which is shown in Figure 3, we take n = 1300 and the param- 
eter (3 is such that only its first 1000 components are non-zero. The non-zero coefficients 
of P are 'sparse', in the sense that most of them are rather small (but non-zero), while 
a few groups of adjacent coefficients are large (cf. the middle panel in Figure 1). Here, 
we choose a collection of candidate models that can pick-out the few groups of large 
coefficients. We divide the first 1000 coefficients of (3 into 20 consecutive blocks of length 
50 each and consider all candidate models that include or exclude a block at a time, 
resulting in 2^^ candidate models. With more than a million candidate models, we do 
not compute GCV(to) for each model under consideration. Instead, we search through 
model space using the obvious greedy general-to-specific strategy: fit the 'overall' model 
containing all 20 blocks and eliminate that block whose elimination leads to the smallest 
increase in the residual sum of squares (this results in a model containing 19 blocks); 
now, proceed inductively until all blocks have been eliminated. This procedure gives a 
data-driven sequence of 20 models of increasing complexity and a corresponding data- 
driven blockwise rearrangement of the coefficients of /3. (The investigation of alternative 
search strategies that are potentially superior to the greedy general-to-specific approach 
is beyond the scope of this paper.) The middle panel of Figure 1 shows the coefficients of 
(3 (in absolute value) in their original order. At the bottom of Figure 3, the coefficients 
are rearranged as described above. The description of the curves is as for Figure 2. 

For Figure 4, that is, the third scenario, we consider exactly the same setting as for 
Figure 3, the only exception being that the coefficients of (3 are here not 'sparse' (see the 
bottom panel in Figure 1). This exemplifies a situation where the collection of candidate 
models is inadequate for the (unknown) regression parameter. 

4.2. Discussion 

In the setting of Figure 2, the approximations developed in Sections 3.2 and 3.3 for 
the Gaussian case have clearly 'kicked in': GCV(to) is very close to the conditional 
mean squared prediction error p^(rn), uniformly over the class of candidate models. 
Also, AlC(m), FPE(m), AlCc(m) and BIC(to) are close to the quantities in (4)-(7), 
respectively. Only GCV(to) gives an accurate indication of the models' performance; 
the other objective functions do not properly reflect the (relative) performance of the 
various candidate models. The model minimizing GCV(to) is very close to the model 
minimizing the conditional mean squared prediction error (m) and the performance of 
that model is well approximated by the generalized cross-validation criterion. Also, the 
model minimizing AlCc(m) performs well. This, however, is more of a coincidence than 
a feature, as it is very easy to find a scenario where AlCc(m) docs not perform well; see 
Figure 4. 
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Figure 1. Starting from the top, the panels show the absolute values of the non-zero coef- 
ficients of the regression parameter /3 used for the simulation results in Figures 2, 3 and 4, 
respectively. In each case, the parameters are such that the the 'signal-to-noise' ratio is five, 
that is, (Var„_/3,CT,E[y] — (7^)^''^/(7 — 5 with ct = 1. 
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700 observations, exhaustive search over €01 candidate models 
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Figure 2. Results for the first simulation example. The black curves show GCV(m) (solid), 
AlC(m) (long dashed), FPE(m) (dotted), AlCc(m) (short dashed) and BlC(m) (dot-dashed); 
the minimum of each of these curves is indicated by a black dot with the name of the model 
selector next to it. The gray curves show p^im) (solid), as well as the expressions in (4)-(7) 
(long dashed, dotted, short dashed and dot-dashed, resp.). 

1300 observations, greedy search over 1048S76 candidate models 
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Figure 3. Results for the second simulation example. Definition of curves as in Figure 2. 
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In the settings of Figures 3 and 4, GCV(to) still provides a reasonably good approx- 
imation to p^{m), but the approximation is less accurate and GCV (to) tends to under- 
estimate p'^{m) for the more complex candidate models. That GCV(to) is less accurate 
as an approximation to p^{m) is due to the fact that the number of candidate models 
is much larger here than in the setting of Figure 2. In particular, the number of candi- 
date models is three orders of magnitude larger than the sample size here. (Recall that 
by partitioning the coefficients of j3 into blocks of length 50, we obtain 2^" candidate 
models. Decreasing the block size results in even more candidate models and in deterio- 
rating accuracy; increasing the block size has the opposite effect.) The phenomenon that 
GCV(to) tends to be smaller than p^{m) is caused by the nature of the greedy search 
through model space which, in each step, eliminates that block of parameters that results 
in the smallest increase of the residual sum of squares. 

The results in Figure 3 show that GCV(m) continues to perform reasonably well, even 
if the number of candidate models is much larger than sample size. Again, GCV(to) 
is close to p^(to) and the model minimizing GCV(to) performs similarly to the overall 
best candidate model, the minimizer of p^(m). And, as before, the other objective func- 
tions do not properly reflect the models' performance. Here, it happens that the models 
minimizing BlC(m) and AICc(to) also perform well, but, again, this need not be the 
case in general (BIC performs poorly in Figures 2 and 4, and AICc performs poorly 
in Figure 4). The model minimizing GCV(to) by using the greedy general-to-specific 
search through model space performs remarkably well here. For comparison, consider the 
following (infcasible) procedure: reorder the coefficients of /3 such that their absolute val- 




Figure 4. Results for the third simulation example. Definition of curves as in Figure 2. 
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ues are decreasing and reorder the columns of X accordingly; after reordering, consider 
leading-term submodels similarly to the setting of Figure 2 and choose the model for 
which the conditional mean squared prediction error is minimized. The performance of 
that model is indicated by the unlabeled extra tick mark on the vertical axis of Figure 3. 
The performance of the (feasible) procedure that minimizes GCV(m) by a greedy search 
is remarkably close to that of the infeasible method just described. 

In Figure 4, the largest model with all 1000 coefficients performs best (in terms of 
conditional mean squared prediction error). This is a situation where the unknown pa- 
rameter is such that none of the lower-dimensional models perform well. Here, the models 
minimizing AlC(m) and FPE(m) perform very well and the model minimizing GCV(m) 
performs comparably, but slightly worse. As before, only GCV(m) gives a reasonable 
indication of the models' actual performance, while the other objective functions do not. 
The models minimizing BlC(m) and AlCc(m) do not perform well here. 

It is striking that the results in Figures 2-4 arc basically unaffected by the underlying 
distribution of the explanatory variables and of the error term in (1). For each of the 
nine combinations of distributions for the explanatory variables and for the error that we 
considered, the results are visually indistinguishable from those shown in Figures 2-4. 
Although our theoretical analysis in Section 3 applies only to the Gaussian case, our 
simulation results suggest that our main findings are hardly affected by departures from 
normality, at least in the examples considered here. 

Appendix: Proofs 
A.l. Auxiliary results 

The first two lemmas in this section are derived using Chernoff's method, or variants 
thereof. 

Lemma A.l. Let A and B he independent random variables distributed as xi '^'^'^ Xb' 
respectively, with a, 6 G N. For each e > 0, we then have 




and 




if e < a/h, 
otherwise. 



The function IC{-,-) is given by 



/C(r,c) = (1 + r)log 



1 + r + c 
1 + r 



rlog 



r + c 



r 



for r > and c> —r. 
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Proof. For < t < 1/2, we have 



P{A/B - a/b >e)= P{A > B{e + a/b)) 

= P{e^p{tA) > exp{tB{e + a/b))) 

= S[P(exp(tA) > cxp{tB{e + a/b))\\B)] 



< E[cxpi-tB{e + a/b))il - 21)-"/^] 
= {l + 2t{e + a/b))-''^\l-2t)-''^\ 



where the inequahty is based on Markov's incquaUty. the moment generating function 
of the x^-distribution and the fact that A and B arc independent. Rewrite the above 
inequahty as 



with f{t) = log(l + 2t{e + a/b)) + (a/b) log(l - 2t). It is elementary to verify that f{t) is 
maximized over < i < 1/2 at = {l/2)e/ {{e + a/b){l + a/b)) and that /(i*) = K.{a/b,e). 
(Note that /(•) is twice continuously differentiable on (0,1/2); solving f'{t) =0 gives 

G (0, 1/2), as above. Because /"(■) is negative, i.e., because /(•) is concave, on (0, 1/2), 
/(•) attains its maximum at t^,.) This gives the first inequality. 

The second inequality is trivial in the case a/b < e. For the case a/b > e, we have 



for each t satisfying < t < l/(2(a/6 — e)), by a similar argument as used above. Again, 
write the inequality in the above display as P{A/B — a/b < — e) < exp(— (&/2)g(t)) and 
note that g{t) is maximized at = {l/{2{a/b — e))){e/{a/b + 1)) which satisfies < < 



Lemma A. 2. Let B be distributed as xl with 6 G N. For each e > 0, we then have 



P(^/B-a/5>e) <e-(''/2)/(*) 



P{A/B - a/b < -e) = P{B{a/b - e) > A) 

^ P{cxp{tB{a/b - e)) > cxp{tA)) 

< (1 - 2t{a/b - e))"''/'(l + 20""/' 



l/(2(a/6-£)), and that ,9(4) = /C(a/6, -e). 



□ 




and 





The function C{-) is given by 



£(c)=c- log(l + c) 



for c > — 1 . 
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Proof. For the first inequality, fix t satisfying < t < 1/2 and note that 



P{B/b - 1 > e) = P{B > 6(1 + e)) 

= P{exp{tB) > exp{tb{l + e))) 
<c-*f(i+^)(l_2t)-^/2 



(as in the proof of Lemma A.l, we use Markov's inequahty and the moment generating 
function of B here). The inequality in the above display can be written as P{B/b— 1 > 
e) < exp(-(6/2)/(t)) with f{t) = 2t(l + e) + log(l - 2t). The function /(■) is maximized 
over (0,1/2) at (£/2)/(l + e) because /'(t,) = and f"{t) < for < t < 1/2. 
Observing that /(t*) equals C{e) gives the first inequality. 

For the second inequality, assume that e < 1 (the other case being trivial). An ar- 
gument similar to that used in the preceding paragraph gives that P{B /b — 1 < — e) < 
exp(-(6/2).g(t)) with g{t) = -2t{l - e) + log(l + 2t) for t > 0. It is elementary to verify 
that g{-) is maximized for i > at = e/(2(l — e)) and that .g(t*) = £(— e). □ 

Lemmas A.l and A. 2 give finite-sample analogs to well-known large deviation results. 
In particular, a result of Killeen, Hettmansperger and Sievers (1972) entails, in the no- 
tation of Lemma A.l, that 



(In the above relation, b is required to go to infinity and a/b is required to converge to a 
limit r £ (0, oo). That relation follows from Example 5.1 of Killeen, Hettmansperger and Sievers 
(1972) upon expressing A/B — a/b as a linear fimction of an i<"-distributed random vari- 
able.) In finite samples, the first upper bound in Lemma A.l gives 



Similar considerations apply, mutatis mutandis, to the upper bounds given in Lemma 
A. 2. (In view of Theorem 1 of Chernoff (1952), that is obvious because of the way these 
upper bounds are constructed.) 

Lemma A. 3. Fix r > and consider the functions /C(-, •) and C{-) defined in Lemmas 
A.l and A. 2, respectively. 

(i) For c satisfying < c <r, we have 



moreover, for c satisfying < c < 1, the above relation continues to hold with £(•) re- 
placing lC{r,-). 





c) < /C(r, — c); 
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(ii) For each c> 0, the functions JC{r,c) and C{c) are related by 



(iii) The function C{-) is increasing on [0, oo); for c satisfying < c < 1, C{c) satisfies 

4</:(c). 

Proof. For part (i), assume first that < c < r. We need to show that /C(r, c) < /C(r, — c) 
or, equivalently, that 

s,l + 7' + c ,r + c 

(l + r)log— <rlog . (8) 

l + r — c r — c 

Setting f{u, v) = ulog((u + v)/{u — v)), the relation in (8) is equivalent to /(r + 1, c) < 
f{r,c). Clearly, this relation is satisfied for c = 0. With this, it suffices to show that 
df{r + l,v)/dv < df{r, v)/dv for < < c. Now, 



dv {u + v){u — v) 

To derive (8), it remains to observe that df{u,v)/dv is decreasing in u for r < u < r + 1 
because 

d'^f(u,v) , uv"^ 
dvdu ivP' — w^)^ 

for 0<'i;<c, c<r and r < u < r + 1 . 

To complete the proof of part (i), assume that < c < 1. We need to show that C{c) < 
£(— c) or, equivalently, that 

2c + logi— ^<0. 

1 + c 

Write g{c) for the expression on the left-hand side of the above inequality. Clearly, g{Q) < 
0. That g(c) < also holds for < c < 1 follows upon observing that g'{c) — — 2c^/(l — c^) 
is negative for < c < 1 . 

For part (ii), write /C(r,c) as /C(r,c) = h{r + 1) — h{r) with h{r) = rlog((r + c)/r). It 
is elementary to verify that h(-) is increasing and concave on [0, oo): for r > and c > 0, 
we have 

h\r)^logil + c/r)- " 



r + c 

= -(iog(i-r^)+^)>o 
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and 



h"{r) 



r{c + r) 




and this entails that /C(r,c) > h'{r + l) = C{-c/{r + 1 + c)) > C{c/ {r + 1 + c)), where the 
last inequality follows from part (i). 

For part (iii), observing that jC'(c) = 1 — 1/(1 + c) shows that £(•) is increasing on 
[O,cxo). The lower bound for C{c) is trivial in the case c~Q. For < c < 1, the lower 
bound follows upon observing that C'{c) > c/2 because c/2 — £'(c) = c(c — l)/(2(c + 1)) 
is negative for such c. □ 

A. 2. Proofs of main results 

Proof of Proposition 3.1. In the case where m is of the form m ~ (1, . . . , 1, 0, . . .), the 

statement in (i) is equivalent to Breiman and Freedman (1983), Theorem 1.3 (provided 
that the quantity p in that theorem is set to p= |m| ; also, note that the quantity Mn^p de- 
fined in Breiman and Freedman (1983) then coincides with the conditional mean squared 
prediction error p'^(m) considered here). For general m (with |7n| < n), note that reorder- 
ing the explanatory variables (and reordering the components of (3 conformably) does 
not change the conditional mean squared prediction error. Hence, Breiman and Freedman 
(1983), Theorem 3.1 also gives the distribution of p^{m) for general to. 

The following preliminary consideration is required to derive the second part of the 
proposition. Throughout the following, fix a candidate model m £ M. Recall the linear 
model (1) and write z for the |TO|-vector of those explanatory variables Xj that are in- 
cluded in the model to. Because y and z are jointly Gaussian, the conditional distribution 
of y given z is again a Gaussian. Because both y and z have mean zero, the conditional 
mean of y given z is a linear function of z. Recalling that the conditional variance of y 
given z is (T^(m), we see that y\\z ^ N{z'd,a'^{m)) for an appropriate |m|-vector 6. In 
other words, (1) can be rewritten as 



with V ~ A^(0, (T^(m)) independent of z. 

To prove the statement in (ii), write Z for the n x |m| matrix of those explanatory 
variables in the sample that are included in the model to. Conditional on Z, it follows from 
(9) and the attending discussion that RSS(to), that is, the residual sum of squares from 
regressing Y on Z, is distributed as cr^(TO)x^_|^|. Because this conditional distribution 
does not depend on Z, the imconditional distribution of RSS(77i) coincides with the 
conditional distribution. □ 

The proof of our main result, that is. Theorem 3.2, rests on the following proposition, 
which gives an upper bound for Pn,i3.a,s{\p^{'m) — /9^(to)| > e) that is tighter, but more 
complex, than the bound given in Theorem 3.2. 



y — z'O + V 



(9) 
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Proposition A. 4. In the setting of Theorem 3.2 and for each e > 0, the probability 
Pn,i3 ,cr,'s{\p^ {'ITT') ~ > e) IS not larger than Bi + B2 + -B3 + -B4. Here, the quantities 

Bi and B2 are defined as 



Bi = exp 
B2 = cxp 



n + 1 — |7T!,| 

2 

n — \m\ 



/C(H/(n+l-|m|),e/(2a2(m))) 



C{ie/{2a\m))){n + 1 - \m\)/{n + 1)) 



in the case a'^(m) > and as Bi = B2 ^ otherwise, where the functions IC{-, •) and C{-) 
are as in Lemmas A.l and A. 2, respectively. The quantity B3 is set equal to zero in the 
case e/(2(T^(m)) > |m|/(n + 1 — |m|); otherwise, B3 is defined as Bi with — e replacing e. 
Finally, the quantity B4 is set equal to zero in the case e/(2cr^(m)) > (ri + l)/(n+l — |m|); 
otherwise, B4 is defined as B2 with — e replacing e. 

Proof. In the case (T^(to) = 0, both p'^{m) and p^{m) are equal to zero with probabihty 
1, m view of Proposition 3.1. Hence, the statement of the proposition is trivial in that 
case. Assume, now, that cr^(m) > 0. The probability of interest, that is, Pn.p,<7.s{\p^{'n^) — 
fP' (m) I > e) , is bounded from above by 



n,/3.(T,S 



n+1 



n — |m| + 1 
n+1 



>e/2 



n — m + 1 



p (m) 



>e/2 



(10) 



Let E, F and G denote independent, ^^-distributed random variables with |to|, n~ \m\ + 
1 and n — |m| degrees of freedom, respectively. Using Proposition 3.1, the probabilities 
in (10) can be reexpressed in terms of E, F and G. Simplifying the resulting expressions, 
we see that (10) is equal to 



P 



E \m\ 
F n — |m| + 1 

G 



P 



n — \m\ 



1 



> 



> 



2ct2(to) 

e n — \m\ + 1 
20-2 (m) n + 1 



(11) 



To complete the proof, we need to show that (11) is not larger than Bi+ B2 + B3 + B^. 
The first term in (11) can be bounded from above using Lemma A.l. In particular, using 
that lemma with E, F, \m\, n + 1 — |m| and e/{2a'^{m)) replacing A, B, a, b and e, 
respectively, we see that the first term in (11) is bounded from above by Bi +-B3. For 
the second term in (11), we use Lemma A. 2 with G, n~ \m\ and (e/ {2a'^{m)))((n — \m\ + 
l)/(n+ 1)) replacing B, b and e, respectively, and obtain that the second term in (11) is 
bounded by B2 + B4. □ 
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Proof of Theorem 3.2. Because the case a'^{m) = is trivial, we assume that a'^{m) > 
0. In view of Proposition A. 4, it suffices to show that Bi + B2 + + B4 is not larger 
than the upper bound given by Theorem 3.2. 

First, consider the sum of Bi and B^. By Lemma A.3(i), Bi + B3 is bounded by 
2Bi. Set r* = \m\/{n + 1 - |m|) and c* = e/{2a^{m)) so that 2Bi 2exp[-((ri + 1 - 
|m|)/2)A^(r*, c*)]. Now, using Lemma A.3(ii) with r* and c* replacing r and c, respec- 
tively, we see that 2Bi, and hence also Bi + B3, is bounded by 



2exp 



1 - \m\ 



C 



1 + c* 



The lower bound for £(•) provided by Lemma A.3(iii) entails an upper bound for the 
expression in the preceding display. Simplifying the resulting bound and recalling that 
^'(•) was defined by 'I'(.t) = {x/{x + l))^/8, we see that 



Bi+B3<2 exp 



-{n + 1- ImD* 



20-2 



1 - 



n + 1 



Note that the right-hand side of the above inequality increases if n -I- 1 is replaced by n. 

For the sum of B2 and B4, Lemma A.3(i) shows that B2 + B4 is bounded by 2B2 or, 
more explicitly, by 



2 exp 



n — \m\ 



20-2 (to) 



1 



Again using Lemma A.3(iii), we get 



B2 + B4<2 exp 



-{n - |m|)* 



1 - 



2(T2(m) \ n+1 



The upper bounds for Bi + B^ and B2 + B4 obtained above immediately entail the 
upper bound given in Theorem 3.2, completing the proof. □ 

The following result gives upper bounds for P„^/3_o^^s(|S'p(m) — i?2(m)| > e) that parallel 
those for Pn.i3,a.'s{\p'^{m) — p^{m)\ > e) given in Proposition A. 4 and Theorem 3.2. 



Proposition A. 5. In the setting of Theorem 3.2 and for each e > 0, the probability 
Pn,i3.a,'>2{\Sp{m) ~ R^{m) \ > e) is not larger than Ci+ €2- Here, Ci is defined as 



Ci = exp 



n — \m\ 



C{{e/a^{m)){n-l-\m\)/{n~l)) 



in the case a^{m) > and as Ci = otherwise. The quantity C2 is set equal to zero in 
the case t/a'^{m) > (n — l)/{n — 1 — \m\); otherwise, C2 is defined as Ci, but with — e 
replacing e. The upper bound Ci + C2 is, furthermore, bounded from above by 



2 exp 



-(n - |m|)\E' 



n-1 
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where ^'(•) is as in Theorem 3.2. 



Proof. As in the proof of Proposition A. 4, the case a'^{m) = is trivial and we as- 
sume that (7^{m) > 0. Using Proposition 3.1 and the formulas for Sp{m) and B?{m) 
given in Sections 2 and 3, respectively, we see that the probability of interest, that is, 
Pn,f3,a,T,{\Sp{rn) — B?{m) \ > e), can be written as 



P 



G 



n — \m\ 



- 1 



> 



(t2(to) 71—1 



where G denotes a random variable that is x^-distributcd with n — |m| degrees of freedom. 
Let e* = 2e(n+ l)(ri— |to| — l)/((n— l)(n— |m| + l)). Then the expression in the preceding 
display coincides with the second term in (11) if, in that second term, e is replaced by e*. 
In the proof of Proposition A. 4, we have seen that the second term in (11) is bounded by 
i?2 + B4 (where Bi, i = 2,4, are as in Proposition A. 4). Using the formulas for B2 and 
B4 with e* replacing e, we obtain the upper bound Ci + C2. To complete the proof, we 
use the upper bound for B2 + B^ obtained in the proof of Theorem 3.2, replace e by e* 
and simplify. □ 

Proof of Corollary 3.3. The result follows upon noting that 



Pn,t3,a,^{ sup 1/5 (to) -p (m)| > e 
< J2 Pn.p,aM\P^M-p^im)\>e) 



(A.12) 



m£M„ 

<4#7\/(„exp 



-^(l-'^n)*(^^(l-r„)^ 



Here, the first inequality is Bonferroni's inequality; the second inequality follows from 
Theorem 3.2 upon noting that the upper bound in that theorem increases in |to|/?i < r„ 
and in (t^(to) < Var/3^cr,s[y] < c. □ 



Proof of Theorem 3.4. The plan of the proof is as follows. We first show that the result 
holds with p^{in) replacing GCV(to) and then that it holds with Sp{m) and R^{m) 
replacing GCV(to) and p'^{m), respectively. Finally, we show that sup^^^^ \ fp'{ra) — 
GCV(to)| and sup^g^^ \j!p'{m) — Sp{m)\ are both Op(a„), uniformly over the set of 
parameters where Var^_cr,s[y] £ c For later use, we note that a„ — > implies that n(l — 
rnY -^00 for k e {0,1,2,3} because > log2/(n(l - r„)3) > \og2/{n{l ~ r,,)''). 

To show that (3) holds with p^{m) replacing GCV(to), assume that /3, a and E 
satisfy Var^^cr,s[j/] < c and fix if > for the moment. By Corollary 3.3, we see that 
^n,e,cr,E(sup^g^^ \p^{m) — /0^(m)| > a„if) is bounded from above by 

4exp[-n(l - r„)«'(ifa„(l - r„)/2c) + log(#M„)]- 
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Because ^'(a:) = {x/{x + l))^/8, the exponent in the preceding display simphfies to 

log(#il/ + l) , , „ ^ 

<-log(#^„4-l)(^-l 

where the inequahty holds for sufficiently large n, that is, n > n(K); here, n{K) is chosen 
such that (ifa„(l — r„)/2c+ f ) < 2 for n > n{K) (that such n{K) exists follows from 
On — *■ 0). Hence, limsup„ sup^ ^ Pn,e ,a,Y.(s.\xp„^^j^^ |p^(m) — p^(m)| > OnK) can be made 
arbitrarily small by choosing K sufficiently large, where the suprcmum is taken over all 
/3, tr and S satisfying Var^_(j_s[y] < c. This shows that (3) holds with p^{m) replacing 
GCV(m). 

That (3) holds with Sp{m) and R^{m) replacing p^{m) and p'^{m), respectively, follows 
from an argument similar to that used in the preceding paragraph, now using Proposition 
A. 5 and Bonferroni's inequality instead of Corollary 3.3. 

We next show that supj„g^^ \p^{'^) — GCV(m)| ~ Op{an), uniformly over the indi- 
cated set of parameters; this will entail (3). Let G be a random variable that is x^- 
distributed with n — \m\ degrees of freedom, let /3, a and E be such that Var^,cr,s[y] < c 
and (lx K > for the moment. We then have 

Pn,f3,.M\p''im) - GCV(m)| > OnK) 

_ pf G OnK {n- \m\ + l){n~ \m\) 

\7i — |m| (T^(to) \m\ 

in view of Proposition 3.1(ii) and the fact that both p^im) and GCV{m) are linear 
functions of RSS(m). Because \m\/n < 1 and a'^irn) < Var^.CT.s[2/] < c, the expression in 
the above display is bounded from above by 



( G a„K , ,T \ 

\n— \m\ c J 



P( ^ , -l>-v/log(.M» + l)n(l-r„)-l) <p( ^ I -1>1 ), 
\n — |m| c I I „ i™i I 

where the equality follows by plugging in the formula for a„, and where the inequality 
holds for sufficiently large n, that is, n > n{K); existence of such n{K) follows from 
log(A^„ + 1) > log(2) and from n{l — r„) oo. The probability of interest, that is, 
-fn,/3,(T,s(|/5^("^) — GCV(m)| > OnK), is thus bounded from above by 

p( ^ - 1 > 1^ < e-(("-|™l)/2)-C(l) < e-n(l-r„)£(l)/2 
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for n > n{K), where the first mequahty follows from Lemma A. 2. Arguing as in (A. 12), 
this inequality entails that 

Pn.p,a.J sup \p^m) -GCV {m)\>a^K] < exp[-7i(l - r„)/:(l)/2 + log#>J„] 

\meM„ J 

for n > n{K). The exponent in the upper bound can be written as 

-n{l - r„)[/:(l)/2 - (log#A/„)/(n(l - r„))]. 

The expression in the above display goes to — oo because n(l — r„)^oo, >C(l)/2>0 and 
log#7W„/(n(l-r„))<a2 ^0. 

Finally, that sup^g^^ |/5^(to) — Sp{'m)\ = Op{an), uniformly over the indicated set of 
parameters, is established by arguing as in the preceding paragraph, but now using Sp{m) 
in place of GCV(to). □ 

Proof of Corollary 3.5. To derive part (i), note that p^{m^) — p^{'m^) is boimded 
from below by zero and from above by 

[p^m:,) - GCV(jK)] + [GCV(to,:) - GCV(m,:)] + [GCVK* ) - p^m:)]. 

By Theorem 3.4, the first and the last term in the above display are Op(a„), uniformly 
over the set of parameters satisfying Var^^^^x;[2/] < c. Because the middle term in the above 
display is non-positive, the statement in part (i) follows. Part (ii) is a direct consequence 
of Theorem 3.4. 

That parts (i) and (ii) continue to hold with Sp{-) or p^{-) replacing GCV(-) and also 
with i?^(-) replacing p'^{-) follows by repeating the argument in the preceding paragraph 
with the corresponding replacements. □ 
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